[conflicted] Will prefer haven::is.labelled over any other package.
[conflicted] Will prefer dplyr::filter over any other package.
Data Management I
Agenda
dplyr- Assignment operators
- Summary Statistics
Learning objectives
By the end of the lecture, you will be able to …
- implement basic variable manipulation
- produce measures of central tendency and variability
Code-along 03
Download and open code-along-03.qmd
Packages
Load the standard packages.
library(here)
library(tidyverse)
library(haven) # not core tidyverse
library(gssr)
library(gssrdoc)
library(summarytools)Load your data & codebook
# Get the data all survey years
data(gss_all)
# Load the codebook
data(gss_dict)dplyr
Tidying Data
Most tasks related to data analysis are not glorious or fancy.
A lot of time is dedicated to whipping a dataset into the shape needed to be able to analyze it.
This task has different names “data cleaning,” “data management,” “data manipulation,” “data wrangling,” “data transformation.”
Source: Jasper Dag Tjaden
dplyr package
The dplyr package provides a complete set of functions that help you solve the most common data manipulation challenges such as:
- filtering observations based on their values
- extracting observations based on their values or positions
- sampling observations based on a specific number or fraction of rows
- sorting observations based on one or several variables
- selecting variables based on their names or positions
- renaming variables
- adding new variables based on existing ones
- summarizing observations or variables to a single descriptive measure
- performing any operation by group
Source: Antoine Soetewey
The pipe |>
The pipe operator passes what comes before it into the function that comes after it as the first argument in that function.
sum(1, 2)[1] 3
. . .
1 |>
sum(2)[1] 3
dplyr style
In data transformation pipelines, always use a
- space before
|> - line break after
|> - indent the next line of code
We’ll talk about data visualization pipes later…
function(argument)
Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
do_that(to_this, to_that, with_those)dplyr verbs (functions) will allow you to solve the vast majority of your data manipulation challenges.
Source: Dr. Mine Çetinkaya-Rundel
dplyr basics
They are organized into four groups based on what they operate on: rows, columns, groups, or tables.
The verbs all have in common:
- The first argument is always a data frame.
- The subsequent arguments typically describe which columns to operate on using the variable names (without quotes).
- The output is always a new data frame.
Source: R for Data Science
dplyr grammar
What’s the advantage of dplyr grammar? We can sequence data manipulation!
gss_all |>
select(year, sex, agekdbrn) |>
filter(year == 2022) |>
drop_na(sex, agekdbrn) |>
group_by(sex) |>
summarise(avg = mean(agekdbrn))# A tibble: 2 × 2
sex avg
<dbl+lbl> <dbl>
1 1 [male] 26.3
2 2 [female] 23.7
select(), filter(), and drop_na()
Use select() to pick specific columns from your dataset.
Use filter() to keep rows that meet a condition.
Use drop_na() to remove rows with missing (NA) values.
. . .
- 1
-
Start with the
gss_alldata frame: - 2
- Keep only the variables in the dataset that we need.
- 3
- Keep only the respondents from the 2022 survey
- 4
- Remove any observations with missing data for our key variables
group_by() and summarize()
Use group_by() to organize your data into groups based on one or more variables.
Use summarize() to compute statistics like total, mean, or median for each group.
. . .
1gss_all |>
2 select(year, sex, agekdbrn) |>
3 filter(year == 2022) |>
4 drop_na(sex, agekdbrn) |>
5 group_by(sex) |>
6 summarise(freq = n())- 1
-
Start with the
gss_alldata frame: - 2
- Keep only the variables in the dataset that we need.
- 3
- Keep only the respondents from the 2022 survey
- 4
- Remove any observations with missing data for our key variables
- 5
- Do the next steps separately for each group in the variable
- 6
- Creates a new data frame with one row for each combination of grouping variables
# A tibble: 2 × 2
sex freq
<dbl+lbl> <int>
1 1 [male] 1031
2 2 [female] 1363
dplyr() in action
Compare the average and median age at first childbirth for U.S. men and women in 2022.
. . .
mutate() in action
Use mutate() to add new columns or change existing ones.
. . .
What proportion of new parents were teenagers (e.g., under 18 years old)?
1gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
2 drop_na(agekdbrn) |>
3 mutate(teen_parent = (agekdbrn < 18) * 1) |>
4 summarise(proportion = mean(teen_parent))- 1
- Manipulate the dataset to variables and observations of interest
- 2
-
Remove missing or
summarise()will report NA - 3
- Returns a logical vector (TRUE, FALSE, or NA) and * 1 converts it to numeric
- 4
-
Mean of the variable, which equals the proportion of
1s
# A tibble: 1 × 1
proportion
<dbl>
1 0.0773
mutate() with case_when()
Use case_when() inside mutate() to create values based on conditions.
. . .
What proportion of new parents had their first child as teenagers, in their 20s, 30s, or after age 40?
gss_all <- gss_all |>
mutate(age_groups = case_when(
agekdbrn < 18 ~ "<18",
agekdbrn >= 18 & agekdbrn <= 29 ~ "18–29",
agekdbrn >= 30 & agekdbrn <= 39 ~ "30–39",
agekdbrn >= 40 ~ "40+",
TRUE ~ NA_character_)
)
gss_all |>
filter(year == 2022) |>
freq(age_groups, report.nas = FALSE, headings = FALSE)
Freq % % Cum.
----------- ------ -------- --------
<18 186 7.73 7.73
18–29 1704 70.82 78.55
30–39 463 19.24 97.80
40+ 53 2.20 100.00
Total 2406 100.00 100.00
Assignment operators
Let’s make a tiny data frame to use as an example:
df <- tibble(x = c(1, 2, 3, 4, 5), y = c("a", "a", "b", "c", "c"))
df# A tibble: 5 × 2
x y
<dbl> <chr>
1 1 a
2 2 a
3 3 b
4 4 c
5 5 c
Assignment operators (cont)
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
df |>
mutate(x = x * 2)# A tibble: 5 × 2
x y
<dbl> <chr>
1 2 a
2 4 a
3 6 b
4 8 c
5 10 c
df# A tibble: 5 × 2
x y
<dbl> <chr>
1 1 a
2 2 a
3 3 b
4 4 c
5 5 c
. . .
Do something and show me
Assignment operators (cont)
Suppose you run the following and then you inspect df.
Will the x variable have values 1, 2, 3, 4, 5 or 2, 4, 6, 8, 10?
. . .
Do something, save result, overwriting original
Assignment: Overwriting
Do something, save result, overwriting original
Assignment: Overwriting (cont)
Do something, save result, overwriting original when you shouldn’t
Assignment: Overwriting (cont)
Do something, save result, overwriting original
data frame
Assignment: Overwriting (cont)
Do something, save result, not overwriting original.
Do something and show me
gss_all |>
select(year, agekdbrn) |>
filter(year == 2022) |>
drop_na(agekdbrn) |>
mutate(age_groups = case_when(
agekdbrn < 18 ~ "<18",
agekdbrn >= 18 & agekdbrn <= 29 ~ "18–29",
agekdbrn >= 30 & agekdbrn <= 39 ~ "30–39",
agekdbrn >= 40 ~ "40+",
TRUE ~ NA_character_)) |>
group_by(age_groups) |>
summarise(
count = n(),
proportion = round(count / sum(count), 3)
)Summary Statistics
Median & Mode
Let’s use dplyr grammar to find the median and mode for the childs variable.
gss_all$childs <- zap_missing(gss_all$childs)
gss_all$childs <- as_factor(gss_all$childs)
gss_all$childs <- droplevels(gss_all$childs)
1gss_all |>
filter(year == 2024) |>
2 freq(childs, report.nas = FALSE) |>
3 tb()- 1
-
Use
dplyrgrammar, starting with the name of the df and a pipe - 2
-
Use the
freq()function as usual - 3
-
Add the
tb()function to turn the table into atibble
# A tibble: 9 × 4
childs freq pct pct_cum
<fct> <dbl> <dbl> <dbl>
1 0 1029 31.4 31.4
2 1 484 14.8 46.2
3 2 851 26.0 72.1
4 3 475 14.5 86.6
5 4 243 7.41 94.0
6 5 96 2.93 96.9
7 6 53 1.62 98.6
8 7 16 0.488 99.1
9 8 or more 31 0.946 100
How many children have you ever had? Please count all that were born alive at any time (including any you had from a previous marriage).
Median & Mean
Let’s use dplyr grammar to find the median and mean for the hrs1 variable.
1median(gss_all$hrs1, na.rm=TRUE)
mean(gss_all$hrs1, na.rm=TRUE)
# show me summary statistics
summary(gss_all$hrs1)- 1
-
na.rmis a logical evaluating to TRUE or FALSE indicating whether NA values should be stripped before the computation proceeds.
[1] 40
[1] 41.11279
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 37.00 40.00 41.11 48.00 89.00 32371
Question 1a. If working, full or part time: how many hours did you work last week, at all jobs?
descr()
Univariate statistics for numerical data
- 1
- Which stats to produce. Either “all” (default), “fivenum”, “common” (see Details), or a selection. See ?descr
- 2
- Don’t forget to make a tidy dataset out of freq() or descr() outputs
# A tibble: 1 × 9
variable mean sd min med max n.valid n pct.valid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 year 39.4 13.9 0 40 89 1768 1768 100
summarize()
gss_all |>
select(year, hrs1, sex) |>
filter(year == 2024) |>
drop_na(hrs1, sex) |>
group_by(as_factor(sex)) |>
summarise(
count = n(),
min = min(hrs1),
median = median(hrs1),
max = max(hrs1),
mean = round(mean(hrs1), digits = 2),
sd = sd(hrs1)
) # A tibble: 2 × 7
`as_factor(sex)` count min median max mean sd
<fct> <int> <dbl+lbl> <dbl> <dbl+lbl> <dbl> <dbl>
1 male 869 0 40 89 [89+ hours] 41.7 13.7
2 female 891 0 40 89 [89+ hours] 37.3 13.7
Think Like a Statistician
On average, in 2024, did parents with 4 or more kids work fewer hours for pay than other parents?
How do we find out?
Think Like a Statistician
Can you reproduce this table?
# A tibble: 4 × 5
childs_group count mean median sd
<chr> <int> <dbl> <dbl> <dbl>
1 1 child 642 39.2 40 13.2
2 2 children 279 40.1 40 12.8
3 3 children 437 39.4 40 14.1
4 4 or more children 396 39.6 40 15.3
Your Data Take
What’s your conclusion to our research question?